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ABSTRACT 

The next generation of wide-area sky surveys offer the power 
to place extremely precise constraints on cosmological pa- 
rameters and to test the source of cosmic acceleration. These 
observational programs will employ multiple techniques based 
on a variety of statistical signatures of galaxies and large- 
scale structure. These techniques have sources of systematic 
error that need to be understood at the percent-level in or- 
der to fully leverage the power of next-generation catalogs. 
Simulations of large-scale structure provide the means to 
characterize these uncertainties. We are using XSEDE re- 
sources to produce multiple synthetic sky surveys of galaxies 
and large-scale structure in support of science analysis for 
the Dark Energy Survey. In order to scale up our production 
to the level of fifty 10^°-particle simulations, we are work- 
ing to embed production control within the Apache Airavata 
workflow environment. We explain our methods and report 
how the workflow has reduced production time by 40% com- 
pared to manual management. 

Categories and Subject Descriptors 

D.2 [Software Engineering]: Programming Environments — 
Aravata; D.2. 11 [Software Architectures]: Domain-specific 
architectures 

General Terms 

Science 
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1. INTRODUCTION 

A decade, and a Nobel Prize[^ after its discovery, the nature 
of cosmic acceleration remains a mystery. Evidence con- 
tinues to favor the simplest form of dark energy, so-called 



^ jhttp : //www . nobelprize . org/nobel.prizes/physics/ 



ACDM models with a constant vacuum energy density (or 
cosmological constant) having a fixed ratio of pressure to en- 
ergy density (equation of state parameter), w = p/ p = —1 
El^l' Testing for departures from this canonical model re- 
quires large, sensitive astronomical surveys capable of deliv- 
ering percent-level statistical constraints on w and the dark 
energy density, f^DE [4j. Departures from ACDM expecta- 
tions may signal a time- varying equation of state parameter 
anticipated by specific theoretical models [Ts] or may indi- 
cate that gravity departs from general relativity on large 
scales 20 , 8 . 

Realizing the full statistical power of upcoming surveys re- 
quires addressing all potential sources of systematic error 
associated with applying tests of cosmic acceleration based 
on the large-scale distribution of galaxies and clusters of 
galaxies. We are performing a suite of simulations that will 
allow us to address a range of sources of systematic error for 
the upcoming Dark Energy Survey (DES)QThe simulations 
can also be used to improve theoretical calibration of the 
clustered matter distribution, including the abundance and 
clustering of massive halosj^that host galactic systems. The 
particular set of simulations we are performing on XSEDE 
in 2012 will form the basis of a Blind Cosmology Challenge 
for the DES collaboration. 



COSMOLOGICAL SIMULATIONS FOR 
DES 

gJIs] is a Stage IlQdark energy project jointly 
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The DES _ 

sponsored by DoE and NSF that is on track to see first light 
in the fall of 2012. The project will use a new panoramic 
camera on the Blanco 4-m telescope at the Cerro Tololo 
Inter- American Observatory in Chile to image ~ 5000 square 
degrees of the sky in the South Galactic Cap in four optical 
bands, and to carry out repeat imaging over a smaller area 
to identify distant type la supernovae and measure their 
lightcurves. In addition, the main imaging area of the DES 
overlaps the South Pole Telescop^sub-mm survey that will 
identify galaxy clusters via the Sunyaev-Zel'dovich effect as 
well as the VISt43 infrared survey of galaxies, which will 
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^ http : / / www . darkenergy survey . org| 

^The term halos refers to self- bound, quasi-equilibirium 
structures that emerge via gravitational collapse of initial 
density peaks. 

^In the language of the Dark En ergy Task Force, see [i] 
^ http : / / pole . uchicap^o . edu/j 
^ iittp: //www. vista. ac .uk 



provide additional information on galaxy photometric red- 
shifts and on the properties of galaxy clusters at large cos- 
mological redshift, z > l[^ Roughly three hundred scientists 
across nearly thirty institutions comprise the DES collabo- 
ration. 

The DES will be the first project to combine four different 
methods to probe the properties of the dark sector (dark 
matter and dark energy) and test General Relativity gravity 
via evolution of the Hubble expansion parameter and the lin- 
ear growth rate of structure. The methods — baryon acoustic 
oscillations in the matter power spectrum, the abundance 
and spatial distribution of galaxy groups and clusters, weak 
gravitational lensing by large-scale structure, and type la 
supernovae — are quasi- independent. Each has sources of 
systematic error associated with it, some of which are unique 
to the method but many of which are shared. Examples 
of the latter are the accuracy of photometric redshift esti- 
mates]^ the form of the non-linear matter clustering power 
spectrum, and shape measurement errors for galaxy images 
that affect cosmic shear and galaxy cluster mass estimates. 
DES will thus be the first survey to address joint system- 
atics in multiple methods probing accelerating expansion of 
the universe. N-body simulations provide key support for 
the analysis of systematics in the three methods associated 
with cosmic large-scale structure (all but supernovae in the 
above list). To validate science analysis codes, the DES Sim- 
ulation Working Group is coordinating a Blind Cosmology 
Challenge (BCC) process, in which a variety of sky realiza- 
tions in different cosmologies will be analyzed, in a blind 
manner, by DES science teams. 

2.1 Blind Cosmology Challenge 

The Blind Cosmology Challenge (BCC) process will require 
generating multiple galaxy catalogs to the full photometric 
depth across the full 5000 square degrees of the DES survey. 
The effort will require roughly 6M SUs and generate 300 TB 
of output. 

Competing effects drive our simulation requirements. On 
one hand, the dark matter distribution needs to be modeled 
within a large cosmic volume. On the other hand, galaxy 
surveys also sample the nearby population of dwarf galaxies, 
implying a need for high spatial and mass resolution. We 
have developed an approach that generates a set of discrete 
N-body sky realizations of dark matter structure spanning 
a range in resolution and volume. We then dress this dark 
matter distribution with galaxies brighter than the DES lim- 
iting magnitudes in each passband. 

To implement the BCC, we plan to produce fifty 10^'^-particle 
N-body runs on XSEDE resources over the next two years. 
The collaboration has requested that BCC models explore 
a variety of cosmologies, parameters of which are known 
only to the Simulation Working Group members. After a 
blind processing period, constraints on cosmological param- 

redshift, derived from spectroscopy measures both dis- 
tance and look-back time to the source. A galaxy at z = 1 
emitted its light when the universe was 6.2 Gyr old and it 
lies at a distance of 3.3 Giga-parsecs from the Milky Way. 
^Photometric redshifts are distance estimates that use the 
multi-color fluxes measured in broad optical-IR bands as 
essentially a (very) low-resolution galaxy spectrum. 



eters from the science teams will be compared against their 
true values, gauging the validity of the processing pipelines. 
We detail ongoing and proposed simulations along with our 
strategy for producing synthetic sky surveys of sufficient 
area and depth in the next section. 

2.2 Computational requirements 

The dark matter structure from N-body simulations of a 
given cosmology forms the basis for galaxy catalog expecta- 
tions. The N-body models we run on XSEDE resources store 
particle configurations, {xi{t),Vi{t)}, in two forms: snap- 
shots of the positions and velocities of all particles i in the 
simulation volume at a fixed time t, and lightcones [12] that 
hold kinematic information for particles lying on the past 
light cone of a virtual observer located at a fixed position, 
Xcen, in the computational volume]^ Once the N-body steps 
are completed, we proceed with additional processing using 
a combination of resources, including XSEDE, SLAG and 
other collaboration institutions. The post-processing aims 
to create a synthetic DES catalog of galaxy properties de- 
rived from the N-body lightcone output. Figure [l] shows a 
schematic representation of the post-processing workflow. 




Figure 1: Processing steps to build a synthetic galaxy cata- 
log are illustrated here and described in the text. The XBaya 
workflow currently controls the top-most element (N-body 
simulations) which consists of methods to sample a cosmo- 
logical power spectrum (ps) , generating an initial set of par- 
ticles (ic) and evolving the particles forward in time with 
Gadget (N-body). The remaining methods are run manu- 
ally on distributed resources. 

Halos, bound systems that host galaxies and clusters of 
galaxies, are identified in these outputs and their proper- 
ties can be used to determine their central galaxy charac- 
teristics. Halos, as well as local density estimates in under- 

^ These particles satisfy the combined space-time require- 
ment, \xi{ti) -Xcen\ = rcosm(tO, whcrc Tcosm (t) is the cosmic 
metric distance as a function of time, a known function of 
the cosmological parameters. 



resolved locations, are used by the ADDGAL^^ algorithm 
to assign galaxy properties to suitably selected dark matter 
particles. The matter along the past lightcone also sets the 
gravitational lensing shear signal applied to these galaxies, 
and we are developing a new mult i- grid, spherical harmonic 
algorithm for this computation. Finally, galaxy catalogs 
that include lensing shear signals are processed with tele- 
scope/instrument/noise effects to produce images expected 
from the back-end electronics of the Dark Energy Camera. 
From these images, which are generated for only 200 sq deg 
of sky, we can develop effective transfer functions to create 
full 5000 sq deg synthetic DES catalogs that contain real- 
istic errors such as star/galaxy mis-classification, blended 
sources, and appropriate photometric errors. 

Generating a cosmological N-body simulation has three main 
steps, described further below. The first two create an ini- 
tial particle set consistent with the structure expected at an 
early time of a chosen cosmological model. The first step 
samples the linear perturbation spectrum of the cosmolog- 
ical model while the second step realizes that density and 
velocity field with a set of particles. For the latter, we use 
a second-order Lagrangian perturbation theory code (2LP- 
Tic) code that has been robustly tested in the community 
[10| [9]. The amount of CPU time required to make these 
initial conditions is small (typically 300-400 CPU hours for 
simulations of the scale described here) , but the memory re- 
quirements are more substantial (slightly less than 2 GB per 
core), although easily achievable on XSEDE machines. 

The final step evolves this particle distribution under its 
self- gravity within an expanding cosmological background. 
For this purpose, we are using a lean version of the Gadget 
cosmological N-body code 22, 23 modified by us to gener- 
ate output along the past lightcone of synthetic observers 
in the computational volume. We have worked on XSEDE 
resources for the past year to modify and optimize the code. 
The lean version has significantly reduced memory overhead, 
44 bytes per particle compared to 84 for standard Gadget. 
This reduction allows the simulations to fit on a smaller 
number of processors, affording better scaling. These sim- 
ulations require 50-lOOk CPU hours depending on the pa- 
rameters and resolution of the computation, and generate 
up to 10 TB of total data output. 

3. WORKFLOW ABSTRACTIONS 

The simulation codes discussed in Section [2] are executed 
on large scale XSEDE resources managed by batch resource 
managers. The heterogeneity and complexity in interfacing 
with these resources slow down the computational scientists 
in harnessing the vast amount of available computing power. 
The eScience workflow systems abstract out these complexi- 
ties and enable the use of innovations made in computational 
middleware. Scientific workflows are one of the prominent 
abstractions that allow scientist the carry out their scien- 
tific discovery and experimentation without having to worry 
about the underlying complexity. These abstractions, while 
lowering the entry and learning curves, also become more 
relevant to address human inefficiency to monitor long run- 
ning jobs. 

^°for Adding Density-Determined GAlaxies to Lightcone 
Simulations 



To build our cosmological workflow, we leverage the experi- 
ence and software developed by the Open Gateways Com- 
puting Environments project [iS] facilitated by the XSEDE 
Extended Collaborative Support Services. The workflow in- 
frastructure is based upon the Apache Airavata 16 frame- 
work. We are going to briefly describe the integration of the 
simulation codes with the workflow infrastructure and spe- 
ciflc customizations made to the framework itself. Further 
details about the framework and its comparison to other 
workflow solutions are discussed in 16 . 

The Airavata workflow system is primarily targeted to sup- 
port long running scientiflc applications on computational 
resources. Airavata's XBaya is a graphical workflow tool, 
allows composition, execution and monitoring of the work- 
flows. The Airavata workflow engine requires these appli- 
cations to be raised to a common abstraction that can be 
accessed using a standard protocol. The Airavata Generic 
Application Factory (GFac) component bridges this gap be- 
tween applications and the workflow systems by providing 
a network accessible web service interface to the scientiflc 
application. 

3. 1 Implementation 

Once the simulation codes are deployed on XSEDE compu- 
tational resources, we register descriptions of these appli- 
cations with the Apache Airavata registry service. These 
descriptions are used by the Airavata GFac component to 
generate the artifacts required to expose the application as 
a service. The workflow developer can access these wrapped 
application services and construct workflows and orchestrate 
executions on target compute resources. The resulting work- 
flow abstractions reduce human inefficiencies by providing a 
uniform interface for the scientist and hiding unnecessary 
complexities. 

To illustrate the construction of a cosmological workflow, we 
will describe the process in developing the N-body simula- 
tion workflow of the process illustrated in Figure [l] Firstly 
the nature of the applications, its execution characteristics, 
and its input and output data are analyzed. The applica- 
tion meta information, including the executable location, its 
nature like serial or MPI, inputs and outputs, are described 
and registered with Airavata registry. This process was fol- 
lowed for the following four applications. 

BCC Parameter Maker 

This initial setup code is written as a python script 
and prepares necessary conflgurations and parameter 
flies for the workflow execution. This simple script is 
forked on the XSEDE Ranger job management nodes. 

CAMB 

The CAMB (Code for Anisotropics in the Microwave 
Background) 14 application computes the power spec- 
trum of dark matter, which is necessary for generating 
the simulation initial conditions. This application is as 
a serial FORTRAN code. The output files are relatively 
small ASCII files describing the power spectrum. 

2LPTic 

The Second-order Lagrangian Perturbation Theory ini- 
tial conditions code j9. (2LPTic) is programmed using 




Message Passing MPI C code that computes the initial 
conditions for the simulation from parameters and an 
input power spectrum generated by CAMB. The out- 
put of this application are a set of binary files that vary 
in size from ^80-250 GB depending on the simulation 
resolution. 

LGadget 

The LGadget simulation code is MPI based C code 
that uses a TreePM algorithm to evolve a gravitational 
N-body system \22^ 2^. The outputs of this step are 
system state snapshot files, as well as light cone files, 
and some properties of the matter distribution, includ- 
ing diagnostics such as total system energies and mo- 
menta. The total output from LGadget depends on 
resolution and the number of system snapshots stored, 
and approaches close to 10 TeraBytes for large DES 
simulation volumes. 

After all the above applications are registered, the Blind 
Cosmology Challenge Workflow is constructed using Aira- 
vata XBaya. The resultant workflow graph is shown in Fig- 
ure [2] The workflow provides capabilities to configure the 
generation of initial conditions as well as the full N-body 
simulation components of the BCC process. 

3.2 Workflow System Enhancements 

Iterative execution support for long running appli- 
cations The N-Body simulation requires multiple days of 
execution, but the XSEDE Ranger cluster limits maximum 
wall time of 48 hours. To mitigate this limitation, the work- 
flow infrastructure has to allow iterative support so the job 
can be broken down into multiple increments of 48 hour jobs 
harnessing the check-point restart capabilities within the ap- 
plication. These capabilities required sophistication beyond 
the blind restarts, in order to account for application exe- 
cution patterns and exception handling. These capabilities 
can be potentially matured into a formal Do- While construct 
semantics of workflow engines. 



Output Transfers The workflow executions tend to pro- 
duce terabytes of data residing on the cluster scratch file 
systems and have to be persisted for a longer durations. 
The data movement to archival systems like TACC Ranch 
for long term storage have to be provided. The large file 
data movement is non-trivial process. Even though many 
advancements have been made in this area, the seamless 
reliable data transfers are still challenging. The emerging 
solutions hke Globus Online [l], GridFTP chent API 2 and 
bbcp [3] are potentially viable options. We leverage the 
Ranch archival system mounting on Ranger and use bbcp 
to copy the workflow outputs for test runs. We have yet 
to explore this as a solution for production executions. To 
transfer data to the post processing remote locations. Globus 
Online and GridFTP client are viable options. 

4. RESULTS 

Toward our goal of fifty cosmological simulations over two 
years, we have completed seven on XSEDE resources in the 
final quarter of 2011 and the first quarter of 2012. Most 
of these simulations were run 'by hand,' while the last was 
performed using the XBaya workflow environment ^ We 
summarize the completed simulations in Table [l] 

For a given cosmology, we generate four N-body simulations 
in nested volumes, consisting of three large- volume realiza- 
tions with 2048"^ particles and one smaller volume of 1400"^ 
particles. This approach allows a better match to halo mass 
selection imposed by the magnitude-limited nature of the 
DES galaxy sample. As indicated in Table [l] the mass res- 
olution varies by nearly a factor of 60 from our smallest to 
largest volumes. A halo resolved by a minimum of 100 parti- 
cles ranges from a mass of 3 x 10^^ in the near- field 
simulation to 2 x 10^^ M© in the far-field[^ The former 
is roughly the mass of our Milky Way galaxy's halo while the 
latter corresponds to the mass scale of clusters of galaxies. 

^^Here, h denotes Hubble's constant. Ho, in dimensionless 
form, h = i7o/100kms"^Mpc"\ 



Figure 3: Full-sky image of the dark matter density in a thin radial slice (50-75 Mpc distance from observer) taken from 
our 1050 Mpc ACDM simulation. Color maps the local matter density relative to the mean value on a logarithmic scale 
ranging from -1 (blue) to roughly 500 (red). 



Table 1: Summary of completed simulations, giving the side 
length L (in Gpc) of the periodic cube, number of par- 

ticles, particle mass (in 10^^ M©), number of simula- 
tions run, kiloSUs used for the run as well as any completed 
postproccessing and the amount of data generated in Ter- 
abytes. 



L 


-/Vpart 


Mpart 


A^run 


kSU 


Data (TB) 


1.05 


1400^ 


0.027 


2 


121 


5.4 


2.60 


2048^ 


0.131 


2 


284 


16.8 


4.00 


2048^ 


0.476 


2 


149 


16.8 


6.00 


2048^ 


1.650 


1 


95 


8.4 


All 






7 


649 


47.4 



Each simulation produces lightcone outputs centered on each 
of the eight corners of the computational volume. By em- 
ploying the periodic boundary conditions of the computa- 
tional domain, we can stitch these octants of sky into a single 
47r representation of the full past lightcone of a hypothetical 
observer placed at the origin of the simulation. A map of 
the resultant structure in a thin radial slice of synthetic sky 
is shown in Figure [3] Along with these lightcone outputs, 
we also record snapshots of the particle configuration in the 
full volume at 20 epochs, leading to an overall data output 
of 8.4 TB for the 2048^-particle runs. 

The combined lightcone files for a single cosmology, ^2 TB 
of data, are transferred manually via Globus Online to SLAG 
for the post-processing steps, illustrated in Figure [l] that 
create DES galaxy catalogs. We first identify dark matter 
halos using a new algorithm, dubbed ROGKSTAR 7 , that 
uses a direct socket-to-socket task-scheduling approach to 



operate efficiently on sub-regions of large simulations. We 
determine a local Lagrangian density by computing the dis- 
tance to the Nth nearest particle, where N is chosen to en- 
close a mass of ^10^^ M©, roughly the transition mass above 
which halos host more than one bright galaxy. With the 
dark matter halos and Lagrangian density estimate in place, 
the ADD GALS algorithm creates a synthetic galaxy cata- 
log for science analysis. Finally, gravitational lensing shear 
is computed from the lightcone matter distribution, and its 
effect on galaxy images recorded. A single post-processed 
galaxy catalog, with 102 parameters per galaxy, is an 0.5 
TB dataset. 

4.1 Efficiency gains with XBaya 

The XBaya workflow shown in Figure [2] was tested and re- 
fined using smaller simulations over the period Oct 2011 to 
Mar 2012. We transitioned to production use for our most 
recent simulations. One workflow-managed simulation has 
run to completion, while a second job crashed because of a 
hardware problem on TAGG Ranger. 

Even with this limited information, we can compare the ef- 
ficiency of running the required jobs under XBaya to those 
previously run manually. Jobs were submitted to the long 
queue at TAGG Ranger, which has maximum resource limits 
of 1024 processors and 48 hour runtime. In Table [2] Total 
Time is the wallclock time interval for the entire production 
process, while GPU Time gives the sum of the run times 
of the required jobs. Times reflect the full N-body produc- 
tion process, from generating initial conditions all the way 
through to completing of the final N-body timestep. Effi- 
ciency is the ratio of GPU to Total times, with 100% repre- 
senting the ideal scenario of running without interruption. 



Table 2: Comparison of Manual and Workflow-enabled pro- 
duction times 

Run Total Time CPU Time Efficiency 

Manual 8:15:33:05 4:07:24:10 500% 

Manual 4:05:39:07 2:17:50:06 64.8% 

Workflow 2:09:53:23 2:05:28:09 92.4% 



The first two rows of In Table |2] list different manually- 
processed simulations. The first is a large simulation that 
needed four total submissions to the queue: one for initial 
conditions, and three for N-body computation. The second 
row was a smaller job requiring three submission, one for ini- 
tial conditions and two for N-body. These runs are relatively 
inefficient because, each time the wallclock limit is reached, 
the user is notified via email, and then must log back into 
the cluster to submit the next job. If the wallclock limit 
is reached at an inconvenient time, considerable time can 
elapse before the next submission. More submissions tends 
to drive up the inefficiency. 

The last row shows the efficiency of running the full pro- 
duction process via the XBaya workflow environment. By 
enabling immediate submission of jobs when the preceding 
job finishes, the efficiency improves to 92%, well above the 
50-60% found for manual processing. Under the workflow, 
the only time spent not computing is spent waiting in the 
queue time on the cluster. That is, our job production be- 
comes limited only by the instantaneous compute resources 
available on TACC Ranger. 

We found that the workflow can also help prevent errors in 
simulation set-up. Our first production level workflow was 
designed to have the same parameters as a pair of manually 
completed simulations. We soon found that the latter, 'by 
hand' simulations were inconsistent with the workflow sim- 
ulations. Investigating the source of the inconsistency, we 
found that the workflow was correct, and that a parameter 
had been mistakenly set to a wrong value in the original 
simulations. This shows the workflow's value in reducing 
the risk of simple human run-time errors. 

4.2 DES science 

Synthetic surveys from the first set of simulations are cur- 
rently in use by DES science groups. We provide here two 
examples of analyses from cluster and weak lensing science 
groups. 

Members of the Galaxy Cluster Working Group are using 
the synthetic galaxy catalogs to evaluate different methods 
for identifying the massive halos that host clusters. Because 
5-band optical photometry provides relatively crude distance 
information for each single galaxy, the ability to identify lo- 
calized spatial clusters is compromised by poor depth reso- 
lution. Different cluster finding algorithms have methods to 
mitigate this loss of information, but none has been applied 
to a large survey with the depth of DES. 

Model galaxies are assigned to specific, unique dark matter 
halos, so an ideal cluster finder would return the list of orig- 
inal halos with all members intact. In practice, the distance 
errors, incorrect choice of cluster location and other effects 
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Figure 4: Purity as a function of the observed mass proxy 
(top) and completeness as a function of true halo mass (bot- 
tom) of the redMaPPer galaxy cluster finding algorithm ap- 
plied to the ffist synthetic DES sky catalog. Colors denote 
the redshift intervals shown in the legend. 



imply that the set of clusters found does not perfectly match 
the input set of halos. However, a correspondence map be- 
tween cluster and halo sets can be derived using joint galaxy 
membership criteria. The utility of a cluster finder can then 
be characterized by two parameters: i) purity, the fraction 
of clusters that correspond to genuinely massive halos, and; 
ii) completeness, the fraction of halos that are found by the 
cluster finder. The ideal value for both parameters is one. 

Figure [4] shows recent purity and completeness measure- 
ments from the redMaPPer cluster finder 19 . The redMaP- 
Per cluster finder works on so-called red-sequence galaxies, 
galaxies with evolved stellar populations that tend to be 
found in massive halos. The algorithm is a Bayesian method 
that assigns galaxies a cluster membership probability based 
on the galaxy's color, and the density of nearby galaxies of 
a similar color. Clusters are identified as clumps of simi- 



lar high- likelihood member galaxies. Figure [4] shows that 
redMaPPer is > 80% pure and has similarly high complete- 
ness above a redshift-dependent minimum halo mass. 



XSEDE facilities (TACC Ranger and UCSD Trestles), and 
would like to expand the workflow logic to choose execution 
location in real time, based on queue loads. 



In a related project, members of the Weak Lensing Work- 
ing Group have been investigating how best to estimate the 
mass of an observed cluster based on the weak lensing dis- 
tortion of background galaxies. Light from faint background 
galaxies is bent by the gravitational field of a massive halo, 
distorting the shape (shear) and density (magnification) of 
the galaxies. Using a maximum likelihood estimator 21 
to fit the synthetic observations, and comparing to the true 
halo masses from the simulation, the group can understand 
how best to estimate masses and also provide feedback on 
the model used to generate the synthetic observations. A re- 
cent calibration plot comparing estimated and true masses 
is shown in Figure [S] 




Figure 5: Comparison of the mass inferred from weak lens- 
ing shear analysis (^-axis) to the true halo mass (x-axis) 
for several thousand galaxy clusters identified in the first 
synthetic DES sky survey. The dotted line is the identity 
relation. 



4.3 Future Directions 

Implementing the Airavata workflow for this project has en- 
tailed some overhead. Scripts that set up the input pa- 
rameter files needed to be developed, and new features were 
added to the existing codebase so that our applications would 
integrate more effectively with the workflow framework. In- 
teraction between the co-authors of this document — domain 
scientists along with TeraGrid AUSS team members — was 
essential to achieving a production-level service. The effort 
invested has been worthwhile, with a significant gain in re- 
alized efficiency (Table |2]) for our first production run. 

In the near term, we have applied for continued XSEDE 
ECS support aimed at integrating the postprocessing and 
catalog production stages into the Airavata workfiow (see 
Figure [T]) . We also want to integrate data movement into 
the workfiow. In addition, we have requested time at two 



We also would like to use Airavata to improve our prove- 
nance practices. The workfiow system can capture prove- 
nance, including information such as when the data set was 
created, by whom, where, and with what application ver- 
sion and which input parameters. Improved provenance can 
enable broader forms of sharing, reuse, and long-term preser- 
vation of our simulations and the resultant galaxy catalogs. 
Additional development could include standardizing an API 
for simulation parameter input and output, so that other 
codes could be easily implemented in the workfiow, such as 
N-body models that employ modified gravity [17[ |15|. 



In the longer term, we could also expand our scope, general- 
izing our galaxy catalog construction process into a science 
gateway that would support broader classes of astrophysical 
studies. The optical catalogs we create could be augmented 
by synthetic surveys at other wavelengths, from radio to X- 
ray, and our focus on galaxies could be expanded to include 
quasars, galactic stars, and other astrophysical objects. 

5. SUMMARY 

To meet the challenge of interpreting Big Astronomical Data 
for cosmological and astrophysical knowledge, new modes of 
study that incorporate model expectations derived from so- 
phisticated simulations will be required. Growing demand 
for simulated data products motivates the automation of 
simulation production methods within grid-aware workfiow 
environments. This work represents a first step in that di- 
rection. 

We are using XSEDE resources to produce multiple syn- 
thetic sky surveys of galaxies and large-scale structure in 
support of science analysis for the Dark Energy Survey. To 
scale up our production to the level of fifty 10^'^-particle sim- 
ulations, we have embedded production control within the 
Apache Airavata workfiow environment, resulting in a signif- 
icant increase in production efficiency compared to manual 
management. 
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